-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only run kube_monitor threads on master MiqServer #21532
Only run kube_monitor threads on master MiqServer #21532
Conversation
ensure_kube_monitors_started | ||
cleanup_orphaned_worker_rows | ||
sync_deployment_settings |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
def sync_workers | ||
# Before syncing the workers check for any orphaned worker rows that don't have | ||
# a current pod and delete them | ||
cleanup_orphaned_worker_rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this belongs in the cleanup_failed_workers
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have to ensure that the master server goes first in the list...otherwise if a non-master server goes first it will try to compare to an empty "global" list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either that or we spawn off the kubernetes watches as a completely separate thing beforehand, and then piggy back on the master server only to ensure the thread is still alive ornot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate podified does this thing where it creates servers for zones. Anything that hits the miq_workers table is scoped for the server of the zone that it's currently processing, but anything that looks at current_pods is global and needs the check to make sure only the primary server does it.
It does make sense to to put this with delete_failed_deployments
in cleanup_failed_workers
as long as the timing works but the latter is global so it has the primary?
check whereas this is for each server and doesn't. It's similar behavior at a different scope.
cleanup_orphaned_worker_rows | ||
|
||
# Update worker deployments with updated settings such as cpu/memory limits | ||
sync_deployment_settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO this seems like something that sync_workers should handle when looping through the worker classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes although, we'd be trading guard clauses:
# currently we loop through `podified_miq_workers` and skip duplicate deployment names we already processed:
next if checked_deployments.include?(worker.worker_deployment_name)
Instead, we'd have to skip worker classes that have no existing podified_miq_workers
/ deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, it's easier now whereas previously it was just messy to have sync_workers
have podified? checks plus introspect the server's current_pods
and current_deployments
.
@@ -131,6 +144,8 @@ def ensure_kube_monitors_started | |||
end | |||
|
|||
def delete_failed_deployments | |||
return unless my_server.is_master? | |||
|
|||
failed_deployments.each do |failed| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failed_deployments has "global" scope (aka not scoped to a specific miq_server) so this should also be only run on the master server
70789e1
to
59abe1f
Compare
On podified there are multiple miq_server instances but only one "master". All miq_servers run on the same kubernetes cluster and thus only have to run one set of monitor_pods/monitor_deployments threads to update the global current_pods/current_deployments hashes.
59abe1f
to
91a9230
Compare
Does it makes sense to go all the way and extract the kubernetes monitoring into its own objects? Like |
ensure_kube_monitors_started | ||
# All miq_server instances have to reside on the same Kubernetes cluster, so | ||
# we only have to sync the list of pods and deployments once | ||
ensure_kube_monitors_started if my_server.is_master? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm so just is_master?
won't do it because that is only set after initial server startup. We might have to check if my_zone
is default (and make sure that server is first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinking maybe we encapsulate that my_zone == "default"
logic into a method
ensure_kube_monitors_started | ||
# All miq_server instances have to reside on the same Kubernetes cluster, so | ||
# we only have to sync the list of pods and deployments once | ||
ensure_kube_monitors_started if my_server_is_primary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change I think we now need the "default" to be first in the list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as I'm looking at all of this closer, this is all really weird to me. We should really be starting this thread outside of the as_each_server { monitor }
loop and also ensure it's still alive outside of the loop. This whole monitoring kube thing has nothing to do with individual servers. The only thing we really need individual servers for is to know if that server's workers are now orphaned or not. Even then we don't need to do it server by server, because we can just make it check all the workers in_my_region
. I'm tempted to say let's just get rid of this whole "looping over the servers" thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good point. I don't think we need anything to be done on each individual server. The code outside of the as_each_server
block could easily check the deployments, settings, and pods/rows for the whole region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we're definitely overloading the miq_server concept here, doesn't help that we're creating multiple miq_servers for a single "platform" instance (aka before 1 miq_server <=> 1 vm)
I think we need to decide what exactly an miq_server is, does is track 1:1 with a runtime platform (vm, k8s namespace) or is it just a logical grouping for zones and other resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, short term I've unshifted the "real" miq_server to the front of the line. Long term we definitely need to either extract this out of miq_server or not create "fake" miq_servers for zones.
Not sure if there is enough here to be considered short-term, maybe we just jump straight in to reworking this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I have a solid idea on my head on that - I'll rip out a quick WIP PR once this is merged to discuss.
Checked commits agrare/manageiq@91a9230~...3773816 with ruby 2.6.3, rubocop 1.13.0, haml-lint 0.35.0, and yamllint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - want @jrafanie to also approve before merge.
LGTM |
On podified there are multiple miq_server instances but only one "master". All miq_servers run on the same kubernetes cluster and thus only have to run one set of monitor_pods/monitor_deployments threads to update the global current_pods/current_deployments hashes.
Follow-up to #21525